{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f73e7a07-98a4-4927-9afe-aaab97592564",
   "metadata": {},
   "source": [
    "# L9: Evaluation Part II\n",
    "\n",
    "Evaluate LLM responses where there isn't a single \"right answer.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0dc3130-511d-4830-8066-e73c9261c452",
   "metadata": {},
   "source": [
    "## Setup\n",
    "#### Load the API key and relevant Python libaries.\n",
    "In this course, we've provided some code that loads the OpenAI API key for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7f1afcb",
   "metadata": {
    "height": 165
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import openai\n",
    "import sys\n",
    "sys.path.append('../..')\n",
    "import utils\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "_ = load_dotenv(find_dotenv()) # read local .env file\n",
    "\n",
    "openai.api_key  = os.environ['OPENAI_API_KEY']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63e6f40d",
   "metadata": {
    "height": 165
   },
   "outputs": [],
   "source": [
    "def get_completion_from_messages(messages, model=\"gpt-3.5-turbo\", temperature=0, max_tokens=500):\n",
    "    response = openai.ChatCompletion.create(\n",
    "        model=model,\n",
    "        messages=messages,\n",
    "        temperature=temperature, \n",
    "        max_tokens=max_tokens, \n",
    "    )\n",
    "    return response.choices[0].message[\"content\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abc82619-ecfe-4b5e-b8d4-1db03db11fc1",
   "metadata": {},
   "source": [
    "### Run through the end-to-end system to answer the user query\n",
    "\n",
    "These helper functions are running the chain of promopts that you saw in the earlier videos."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3eae3d2",
   "metadata": {
    "height": 182
   },
   "outputs": [],
   "source": [
    "customer_msg = f\"\"\"\n",
    "tell me about the smartx pro phone and the fotosnap camera, the dslr one.\n",
    "Also, what TVs or TV related products do you have?\"\"\"\n",
    "\n",
    "products_by_category = utils.get_products_from_query(customer_msg)\n",
    "category_and_product_list = utils.read_string_to_list(products_by_category)\n",
    "product_info = utils.get_mentioned_product_info(category_and_product_list)\n",
    "assistant_answer = utils.answer_user_msg(user_msg=customer_msg,\n",
    "                                                   product_info=product_info)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a5e117e",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "print(assistant_answer) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1411a11-c7e3-4124-983d-149eeb2a34b1",
   "metadata": {},
   "source": [
    "### Evaluate the LLM's answer to the user with a rubric, based on the extracted product information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "573f090d",
   "metadata": {
    "height": 80
   },
   "outputs": [],
   "source": [
    "cust_prod_info = {\n",
    "    'customer_msg': customer_msg,\n",
    "    'context': product_info\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "395cbf45",
   "metadata": {
    "height": 845
   },
   "outputs": [],
   "source": [
    "def eval_with_rubric(test_set, assistant_answer):\n",
    "\n",
    "    cust_msg = test_set['customer_msg']\n",
    "    context = test_set['context']\n",
    "    completion = assistant_answer\n",
    "    \n",
    "    system_message = \"\"\"\\\n",
    "    You are an assistant that evaluates how well the customer service agent \\\n",
    "    answers a user question by looking at the context that the customer service \\\n",
    "    agent is using to generate its response. \n",
    "    \"\"\"\n",
    "\n",
    "    user_message = f\"\"\"\\\n",
    "You are evaluating a submitted answer to a question based on the context \\\n",
    "that the agent uses to answer the question.\n",
    "Here is the data:\n",
    "    [BEGIN DATA]\n",
    "    ************\n",
    "    [Question]: {cust_msg}\n",
    "    ************\n",
    "    [Context]: {context}\n",
    "    ************\n",
    "    [Submission]: {completion}\n",
    "    ************\n",
    "    [END DATA]\n",
    "\n",
    "Compare the factual content of the submitted answer with the context. \\\n",
    "Ignore any differences in style, grammar, or punctuation.\n",
    "Answer the following questions:\n",
    "    - Is the Assistant response based only on the context provided? (Y or N)\n",
    "    - Does the answer include information that is not provided in the context? (Y or N)\n",
    "    - Is there any disagreement between the response and the context? (Y or N)\n",
    "    - Count how many questions the user asked. (output a number)\n",
    "    - For each question that the user asked, is there a corresponding answer to it?\n",
    "      Question 1: (Y or N)\n",
    "      Question 2: (Y or N)\n",
    "      ...\n",
    "      Question N: (Y or N)\n",
    "    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)\n",
    "\"\"\"\n",
    "\n",
    "    messages = [\n",
    "        {'role': 'system', 'content': system_message},\n",
    "        {'role': 'user', 'content': user_message}\n",
    "    ]\n",
    "\n",
    "    response = get_completion_from_messages(messages)\n",
    "    return response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbc4af8e",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "evaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)\n",
    "print(evaluation_output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "836eb97c-025b-4643-823f-7fd3fac6ce20",
   "metadata": {},
   "source": [
    "### Evaluate the LLM's answer to the user based on an \"ideal\" / \"expert\" (human generated) answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2805f188",
   "metadata": {
    "height": 879
   },
   "outputs": [],
   "source": [
    "test_set_ideal = {\n",
    "    'customer_msg': \"\"\"\\\n",
    "tell me about the smartx pro phone and the fotosnap camera, the dslr one.\n",
    "Also, what TVs or TV related products do you have?\"\"\",\n",
    "    'ideal_answer':\"\"\"\\\n",
    "Of course!  The SmartX ProPhone is a powerful \\\n",
    "smartphone with advanced camera features. \\\n",
    "For instance, it has a 12MP dual camera. \\\n",
    "Other features include 5G wireless and 128GB storage. \\\n",
    "It also has a 6.1-inch display.  The price is $899.99.\n",
    "\n",
    "The FotoSnap DSLR Camera is great for \\\n",
    "capturing stunning photos and videos. \\\n",
    "Some features include 1080p video, \\\n",
    "3-inch LCD, a 24.2MP sensor, \\\n",
    "and interchangeable lenses. \\\n",
    "The price is 599.99.\n",
    "\n",
    "For TVs and TV related products, we offer 3 TVs \\\n",
    "\n",
    "\n",
    "All TVs offer HDR and Smart TV.\n",
    "\n",
    "The CineView 4K TV has vibrant colors and smart features. \\\n",
    "Some of these features include a 55-inch display, \\\n",
    "'4K resolution. It's priced at 599.\n",
    "\n",
    "The CineView 8K TV is a stunning 8K TV. \\\n",
    "Some features include a 65-inch display and \\\n",
    "8K resolution.  It's priced at 2999.99\n",
    "\n",
    "The CineView OLED TV lets you experience vibrant colors. \\\n",
    "Some features include a 55-inch display and 4K resolution. \\\n",
    "It's priced at 1499.99.\n",
    "\n",
    "We also offer 2 home theater products, both which include bluetooth.\\\n",
    "The SoundMax Home Theater is a powerful home theater system for \\\n",
    "an immmersive audio experience.\n",
    "Its features include 5.1 channel, 1000W output, and wireless subwoofer.\n",
    "It's priced at 399.99.\n",
    "\n",
    "The SoundMax Soundbar is a sleek and powerful soundbar.\n",
    "It's features include 2.1 channel, 300W output, and wireless subwoofer.\n",
    "It's priced at 199.99\n",
    "\n",
    "Are there any questions additional you may have about these products \\\n",
    "that you mentioned here?\n",
    "Or may do you have other questions I can help you with?\n",
    "    \"\"\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67cceac4-0d92-48db-99de-4d6776ba96e8",
   "metadata": {},
   "source": [
    "### Check if the LLM's response agrees with or disagrees with the expert answer\n",
    "\n",
    "This evaluation prompt is from the [OpenAI evals](https://github.com/openai/evals/blob/main/evals/registry/modelgraded/fact.yaml) project.\n",
    "\n",
    "[BLEU score](https://en.wikipedia.org/wiki/BLEU): another way to evaluate whether two pieces of text are similar or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "596db8b9",
   "metadata": {
    "height": 726
   },
   "outputs": [],
   "source": [
    "def eval_vs_ideal(test_set, assistant_answer):\n",
    "\n",
    "    cust_msg = test_set['customer_msg']\n",
    "    ideal = test_set['ideal_answer']\n",
    "    completion = assistant_answer\n",
    "    \n",
    "    system_message = \"\"\"\\\n",
    "    You are an assistant that evaluates how well the customer service agent \\\n",
    "    answers a user question by comparing the response to the ideal (expert) response\n",
    "    Output a single letter and nothing else. \n",
    "    \"\"\"\n",
    "\n",
    "    user_message = f\"\"\"\\\n",
    "You are comparing a submitted answer to an expert answer on a given question. Here is the data:\n",
    "    [BEGIN DATA]\n",
    "    ************\n",
    "    [Question]: {cust_msg}\n",
    "    ************\n",
    "    [Expert]: {ideal}\n",
    "    ************\n",
    "    [Submission]: {completion}\n",
    "    ************\n",
    "    [END DATA]\n",
    "\n",
    "Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.\n",
    "    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:\n",
    "    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.\n",
    "    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.\n",
    "    (C) The submitted answer contains all the same details as the expert answer.\n",
    "    (D) There is a disagreement between the submitted answer and the expert answer.\n",
    "    (E) The answers differ, but these differences don't matter from the perspective of factuality.\n",
    "  choice_strings: ABCDE\n",
    "\"\"\"\n",
    "\n",
    "    messages = [\n",
    "        {'role': 'system', 'content': system_message},\n",
    "        {'role': 'user', 'content': user_message}\n",
    "    ]\n",
    "\n",
    "    response = get_completion_from_messages(messages)\n",
    "    return response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5791815d",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "print(assistant_answer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7790cae",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "eval_vs_ideal(test_set_ideal, assistant_answer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c0ff694a",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "assistant_answer_2 = \"life is like a box of chocolates\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1de6e039",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "eval_vs_ideal(test_set_ideal, assistant_answer_2)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
